-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push down version
filter in the Delta $history
table
#16192
Push down version
filter in the Delta $history
table
#16192
Conversation
f0edb4a
to
78ef37c
Compare
78ef37c
to
f24e14d
Compare
88b453b
to
cf835ab
Compare
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeHistoryTable.java
Outdated
Show resolved
Hide resolved
{ | ||
requireNonNull(startVersion, "identity is null"); | ||
requireNonNull(endVersion, "identity is null"); | ||
verify(startVersion.orElse(0L) <= endVersion.orElse(startVersion.orElse(0L)), "startVersion is greater than endVersion"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a duplicate code from DeltaLakeTransactionLogEntryIterator? we can extract into a method verifyVersion
?
...main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogEntryIterator.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogEntryIterator.java
Outdated
Show resolved
Hide resolved
...main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogEntryIterator.java
Outdated
Show resolved
Hide resolved
0323ce7
to
389fe6c
Compare
Rebase on top of |
389fe6c
to
bbfbbed
Compare
do you need those .crc files? Wouldn't all the tests work without them ? |
@homar no the files are not needed for the tests to run successfully. I copied the content of the table as it was from Databricks. I see a mixture in the test resources of |
I don't have strong opinion. I just don't think we need them so maybe it would be better not to have unnecessary files. But you don't have to remove them if you don't agree |
1 similar comment
I don't have strong opinion. I just don't think we need them so maybe it would be better not to have unnecessary files. But you don't have to remove them if you don't agree |
bbfbbed
to
9bfc116
Compare
Rebasing on |
9bfc116
to
49d11bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First two commits only
stream(commitInfoEntries) | ||
.map(commitInfoEntry -> getRecord(commitInfoEntry, timeZoneKey)) | ||
.iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than going back and forth to Stream
you can just use Iterables.transform
or Iterators.transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did change to Iterables.transform
in the 1st commit.
However, in the 2nd commit, I find the stream approach more straightforward to read. I don't see how to get to a similar result without Streams.stream
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're just looking for Iterators.concat
? To replace the Stream flatMap
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeHistoryTable.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeHistoryTable.java
Show resolved
Hide resolved
.../src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogIterator.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogIterator.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogIterator.java
Outdated
Show resolved
Hide resolved
import static java.util.Objects.requireNonNull; | ||
|
||
public class DeltaLakeTransactionLogIterator | ||
implements Iterator<List<DeltaLakeTransactionLogEntry>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a flat Iterator<DeltaLakeTransactionLogEntry>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flat iterator comes with increased overhead. I created initially such an iterator, but it has rather high complexity for little gain - the transaction log file is read at once.
See draft implementation
https://github.com/trinodb/trino/compare/f0edb4afb7f92815526e121acffbef19382a9558..78ef37c36721b29952f052fb7da9f1a5bb847bb9
if (stopped) { | ||
return Optional.empty(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stopped
seems unnecessary to me, but maybe I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without stopped
we could call hasNext()
several times and this would incur further unnecessary reads in case when the endVersion
is not present. Same thing for the situation when a transaction log in between - e.g. 1,2,3,5,6
- note that 4
is missing - is missing.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeHistoryTable.java
Outdated
Show resolved
Hide resolved
"00000000000000000006.json", 17, | ||
"_last_checkpoint", 17)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a sense of why these two are so high on these queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I counted while debugging 9
times the call to io.trino.connector.system.SystemTablesMetadata#checkAndGetTable
while performing a SELECT
from the $history
system metadata table.
Each of this calls eventually causes a subsequent DeltaLakeMetadata#getTableHandle()
call which calls TransactionLogParser#tryReadLastCheckpoint
.
I was planning to create an issue to further investigate whether we could spare some getTableHandle()
calls.
49d11bd
to
903b2e3
Compare
fileSystem, | ||
new Path(metastore.getTableLocation(tableName, session)), | ||
Optional.empty(), | ||
Optional.empty())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these 4 lines be more indented ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found a better way to highlight the stream. Could you pls suggest a change in this regard?
903b2e3
to
1119dd8
Compare
1119dd8
to
0760302
Compare
CI hit #15187 |
Superseded by #18427 |
Description
Doing the pushdown for the
version
column predicates helps out in avoiding to load unnecessary transaction log files from the storage.As part of the changes done for this feature, the Delta
$history
table avoids materializing all the entries of the table at once by using the same strategy as #16112Additional context and related issues
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: